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ABSTRACT 


1. INTRODUCTION 


Videos contain very rich semantic information. Traditional 
hand-crafted features are known to be inadequate in analyz¬ 
ing complex video semantics. Inspired by the huge success 
of the deep learning methods in analyzing image, audio and 
text data, significant efforts are recently being devoted to 
the design of deep nets for video analytics. Among the many 
practical needs, classifying videos (or video clips) based on 
their major semantic categories (e.g., “skiing”) is useful in 
many applications. In this paper, we conduct an in-depth 
study to investigate important implementation options that 
may affect the performance of deep nets on video classifica¬ 
tion. Our evaluations are conducted on top of a recent two- 
stream convolutional neural network (CNN) pipeline, which 
uses both static frames and motion optical flows, and has 
demonstrated competitive performance against the state-of- 
the-art methods. In order to gain insights and to arrive 
at a practical guideline, many important options are stud¬ 
ied, including network architectures, model fusion, learn¬ 
ing parameters and the final prediction methods. Based 
on the evaluations, very competitive results are attained on 
two popular video classification benchmarks. We hope that 
the discussions and conclusions from this work can help re¬ 
searchers in related fields to quickly set up a good basis for 
further investigations along this very promising direction. 

Categories and Subject Descriptors 

1.5.2 [Pattern Recognition]: Design Methodology; H.3.1 
[Information Storage and Retrieval]: Content Analysis 
and Indexing 

General Terms 

Algorithms, Measurement, Experimentation. 
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With the popularity of video recording devices and con¬ 
tent sharing activities, there is a strong need for techniques 
that can automatically analyze the huge scale of video data. 
Video classification serves as a fundamental and essential 
step in the process of analyzing the video contents. Eor ex¬ 
ample, it would be extremely helpful if the massive consumer 
videos on the Web could be automatically classified into pre¬ 
defined categories. Learning semantics from the complicated 
video contents is never an easy task, and methods based on 
traditional hand-crafted features and prediction models are 
known to be inadequate [^. 

In recent years, deep learning based models have been 
proved to be more competitive than the traditional methods 
on solving complex learning problems in various domains. 
Eor example, the deep neural network (DNN) has been suc¬ 
cessfully used for acoustic modeling in the large vocabulary 
speech recognition problems [^. Moreover, the deep learning 
based methods have been shown to be extremely powerful in 
the image domain. In 2012, Krizhevsky et al. were the first 
to use a completely end-to-end deep convolutional neural 
network (CNN) model to win the famous ImageNet Large 
Scale Visual Recognition Challenge (ILSVRC), outperform¬ 
ing all the traditional methods by a large margin . In the 
most recent 2014 edition of the ILSVRC, Szegedy et al. de¬ 
veloped an improved and deeper version of the CNN, which 
further reduced the top-5 label error rate to just 6.7% over 
one thousand categories . In the text domain, deep mod¬ 
els have also been successfully used for sentence parsing, 
sentiment prediction and language translation problems [27l 
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On video data, however, deep learning often demonstrated 
worse results than the traditional techniques [8 12 . This is 
largely due to the difficulties in modeling the unique char¬ 
acteristics of the videos. On one hand, the spatial-temporal 
nature demands more complex network structures and maybe 
also advanced learning methods. One the other hand, so far 
there is very limited amount of training data with manual 
annotations in the video domain, which limits the progress 
of developing new methods as neural networks normally re¬ 
quire extensive training. Very recently, Simonyan et al. pro¬ 
posed two-stream CNN, an effective approach that trains 
two CNNs using static frame and temporal motion sepa¬ 
rately [^. The temporal motion stream is converted to 
successive optical flow images so that the conventional CNN 
designed for images can be directly deployed. 

Although promising results were observed in [^, we un¬ 
derline that the performance of deep learning in video clas- 









sification is subject to many implementation options, and 
there are no in-depth and systematic investigations on this 
in the held. In this paper, we conduct extensive experiments 
on two popular benchmarks to evaluate several important 
options, including not only network structures and learn¬ 
ing parameters, but also model fusion that combines results 
from different networks and prediction strategies that map 
network outputs to classihcation labels. The two-stream 
CNN approach is adopted as the basic pipeline for the eval¬ 
uations in this work. By evaluating the implementation op¬ 
tions, we intend to answer the question of what network 
settings and implementation options are likely to produce 
good video classihcation results. As implementing a deep 
learning based system for video classihcation is a very dif- 
hcult task, we hope that the discussions and conclusions 
from this work are helpful for researchers in this held and 
can stimulate future studies. 

The rest of this paper is organized as follows. Section 2 
reviews related works. In Section 3, we briehy introduce the 
two-stream CNN approach. Section 4 discusses the evalu¬ 
ated implementation options and Section 5 reports and an¬ 
alyzes experimental results. Finally, Section 6 summarizes 
the hndings in this work. 


2. RELATED WORKS 

Extensive studies have been conducted on video classi¬ 
hcation in the multimedia and computer vision communi¬ 
ties. State-of-the-art video classihcation systems are usu¬ 
ally built on top of multiple discriminative feature repre¬ 
sentations. To achieve better performance, various features 
have been developed. For instance, Laptev et al. extended 
the traditional SIFT features to obtain the Space-Time In¬ 
terest Points (STIP) by hnding representative tubes in 3D 
space [^. Wang et al. proposed the dense trajectory fea¬ 
tures, which densely sample local patches from each frame 
at different scales and then track them in a dense optical 
ffow held over time [^. Besides the feature descriptors, one 
can obtain further improvements by adopting advantageous 
feature encoding strategies like the Fisher Vector or uti¬ 
lizing fusion techniques 3^ 33 to integrate information 
from different features. 

The aforementioned hand-crafted features like the dense 
trajectories have demonstrated state-of-the-art performance 
on many video classihcation tasks. However, these features 
are still unsatisfying and the room for further improvements 
may be limited. In contrast to the hand-crafted features, 
there is a growing trend of learning features directly from 
raw data using deep learning methods, among which the 
CNN has attracted wide attentions due to their great 
success in image classihcation [^[^, visual object detec¬ 
tion [^, etc. 

Compared with the extensive studies on using deep learn¬ 
ing for image analysis, only a few works have exploited 
this approach for video analysis. Ji et al. and Karparthy 
et al. extended the CNN into temporal domain by stack¬ 
ing static frames, upon which convolution can be performed 
with space-temporal hlters [^. However, the learned 
representations from these methods produced worse results 
than the state-of-the-art hand-crafted features like the dense 
trajectories [^. More recently, Simonyan et al. achieved 
very competitive performance by training two CNNs on spa¬ 
tial (static frames) and temporal (optical hows) streams sep¬ 
arately and then fusing the two networks. 


Most CNN-based approaches rely on the neural networks 
to perform the hnal class label prediction, normally using 
a softmax layer or a linear layer [^. Instead of 

direct prediction by the network, Jain et al. conducted ac¬ 
tion recognition using support vector machines (SVMs) with 
features extracted from off-the-shelf CNN models [^. Their 
impressive results in the THUMOS action recognition chal¬ 
lenge indicate that CNN features are very powerful. In 
addition, a few works attempted to apply the CNN repre¬ 
sentations with Recurrent Neural Network (RNN) models 
to capture the temporal information in videos and perform 
classihcation within the same network. Donahue et al. lever¬ 
aged the Long-Short Term Memory (LSTM) RNN model for 
action recognition and Venugopalan et al. proposed to 
translate videos directly to sentences with the LSTM model 
by transferring knowledge from image description tasks . 
RNN shares the same motivation with the temporal pathway 
in the two-stream framework [24] . 

In this paper, we provide an in-depth study on various 
implementation choices of deep learning based video classi¬ 
hcation. On top of the two-stream CNN pipeline [^, we ex¬ 
amine the performance of the spatial and temporal streams 
separately and jointly with different network architectures 
under various sets of parameters. In addition, we also ex¬ 
amine the effect of different prediction options, including di¬ 
rectly using a CNN with a softmax layer for end-to-end clas¬ 
sihcation and adopting SVMs on the features learned from 
the CNNs. The RNN models are not considered since they 
have not been fully explored and only demonstrated limited 
performance gain in the context of video classihcation. This 
paper hlls the gap in the existing works on deep learning 
for video classihcation, where most people have focused on 
designing a new classihcation pipeline or a deeper network 
structure without systematically evaluating and discussing 
the implementation options. 


3. TWO-STREAM CNN 

According to the hndings in neuroscience, human visual 
system processes what we see through two kinds of streams, 
namely the ventral pathway and the dorsal pathway respec¬ 
tively. The ventral pathway is responsible for processing 
the spatial information, such as shape and color, while the 
dorsal pathway is responsible for processing the motion in¬ 
formation [14[[ 5 ]. Mimicking the human vision mechanism, 
the video data, i.e., sequential image frames, can be natu¬ 
rally decomposed into the spatial and temporal components 
in a similar way. More specihcally, image content including 
scenes and objects belong to the spatial component. The 
complementary temporal component contains motion infor¬ 
mation across video frames. As an early attempt, Schindler 
et al. extracted spatial and temporal features on still images 
and optical hows by using the Gabor hlters [^. With the 
recent success of the CNN, as briehy mentioned earlier, Si¬ 
monyan et al. promoted this framework by the two-stream 
CNN structure that models both the spatial and the tempo¬ 
ral information [^. Figure shows the processing pipeline 
of this approach. One CNN is applied to process the spa¬ 
tial stream of the video data, and the other CNN is used to 
handle the temporal stream. 

The processing data how of the spatial stream is shown 
on top of Figure This CNN has the same structure as 
a general deep CNN for image classihcation 13 
It directly takes individual video frames as network inputs 
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Figure 1: The processing pipeline of the two-stream CNN [^, which has demonstrated competitive video classification 
performance. “Conv.” indicates “Convolution” and “FC” is the abbreviation of “Fully Connected”. Our evaluations in this 
work are conducted on top of this pipeline. 


followed by several convolutional layers, pooling layers and 
fully connected (FC) layers. Finally, after a softmax layer, 
the network outputs in the range of [0, 1] are taken as the 
predicted probabilities of the video classes. Multiple frames 
are sampled from each input video. The network processes 
one frame each time and the predictions on individual frames 
are merged simply by average probability fusion. 

For the temporal stream, in order to capture the mo¬ 
tion information, the temporal CNN takes stacked optical 
flows as input, rather than the frames in the spatial stream. 
Specifically, dense optical flow can be seen as a set of dis¬ 
placement vector fields between consecutive frames. For a 
given pair of consecutive frames, the horizontal and verti¬ 
cal components of the calculated displacement vector fields 
between them can be used to generate two “optical flow im¬ 
ages”, respectively. To further consider temporal informa¬ 
tion, one can stack the optical flow images of each frame 
at time t and its L subsequent frames into a stacked 2L- 
channel optical flow image (in contrast to the traditional 3- 
channel RGB images). The network architecture and train¬ 
ing process of the temporal CNN are basically the same as 
the spatial counterpart, except that the input images have 
a different number of channels. There are multiple stacked 
2L-channel optical flow images in a video. The way of fus¬ 
ing predictions on these individual images is also the same 
as that of the spatial channel. 

We now have predictions from the two CNNs, based on 
the spatial and temporal streams separately. The last step 
is to combine the two streams to produce the hnal output. 
For this, linear fusion with fixed weights was used in [24] . 

4. IMPLEMENTATION OPTIONS 

In this section, we discuss several important implementa¬ 
tion options that can affect the performance of deep learning 
for video classification, including network architectures, fu¬ 
sion strategies, learning parameters and prediction options. 

4.1 Network Architectures 

Network architecture plays a very important role in the 
performance of a deep learning model. The current surge of 
the CNNs in many tasks heavily relies on the use of supe¬ 
rior network architectures, such as the AlexNet and the 
GoogLeNet [^. Popular CNN architectures usually have 
alternating convolutional layers and auxiliary layers (e.g., 
pooling layers), and are toped by a few fully connected (FC) 


layers. Recent studies demonstrate that better recognition 
accuracies can be achieved by deepening the CNN architec¬ 
tures |25| 30 , which means that deeper architectures can 
lead to progressively more discriminative features at higher 
layers. To evaluate the performance of different network ar¬ 
chitectures, in this work, we adopt and evaluate a medium 
network structure, CNN_M [^, and a very recent deeper 


network architecture, VGG_19 |25j. See Table for the de¬ 
tailed conhgurations of CNN_M and VGG_19. 


CNN_M Network. 

CNN_M basically follows the same spirit as the widely 
adopted AlexNet [^. It contains hve convolutional layers 
followed by three FC layers and the input image is fixed to 
the size of 224 x 224. Compared with AlexNet [^, CNN_M 
possesses more convolutional filters. On the first convolu¬ 
tional layer, the size and stride are both smaller (7x7 and 
2 respectively) than those in AlexNet, while the remaining 
convolutional layers have the same filter size and stride. By 
increasing the number of hlters and reducing the hlter size 
and the stride step, CNN_M can discover more subtle in¬ 
formation from input images, and hence can obtain more 
robust feature representations and better predictions. Note 
that CNN_M offers a 13.5% top-5 error on the ILSVRC-2012 
validation set (a famous image recognition benchmark), 
which is generally considered as a good result. 


VGG_19 Network. 

VGG_19 not only further reduces the size of convolutional 
hlters and the stride, but more importantly, it also extends 
the depth of the network. More precisely, VGG_19 consists 
of nineteen layers, including sixteen convolutional layers and 
three fully connected layers. In addition, the size of all the 
convolutional hlters decreases to 3 x 3 and the stride re¬ 
duces to only 1 pixel, which enables the network to explore 
hner-grained details from the feature maps. With this much 
deeper architecture, VGG_19 possess strong capabilities of 
learning more discriminative features and the high-level h- 
nal predictions. It can produce a 7. 1% top-5 error rate on 
the ILSVRC-2012 validation set [2^ . 

4.2 Fusion Strategies 

Fusing multiple clues is a standard technique in video 
analysis, which can often lead to better performance. We 
divide this part into model fusion and modality fusion. 














































CNN M 

VGG 19 


96 X 7 X 7 

64 X 3 X 3 


(stride: 2, padding: 0) 

64 X 3 X 3 


LRN 

x2 pooling 


x2 pooling 

128 X 3 X 3 


256 X 5 X 5 

128 X 3 X 3 


(stride: 2, padding: 1) 

x2 pooling 

03 

LRN 

256 X 3 X 3 

CD 

x2 pooling 

256 X 3 X 3 


512 X 3 X 3 

256 X 3 X 3 

rH 

(stride: 1, padding: 1) 

256 X 3 X 3 

.2 

512 X 3 X 3 

x2 pooling 

"■+J 

(stride: 1, padding: 1) 

512 X 3 X 3 

1 

512 X 3 X 3 

512 X 3 X 3 

o 

(stride: 1, padding: 1) 

512 X 3 X 3 

U 

x2 pooling 

512 X 3 X 3 
x2 pooling 

512 X 3 X 3 

512 X 3 X 3 

512 X 3 X 3 

512 X 3 X 3 
x2 pooling 

OD 

f-( 

4,096 neurons 

4,096 neurons 





2,048 neurons 

4,096 neurons 

u 




softmax 

softmax 


Table 1: Configurations of two networks: CNN_M and 
VGG_19. The convolutional kernel parameter is denoted in 
the form hlters x kernel size x x kernel size y”. For both 
networks, max-pooling with a sampling factor of 2 is used 
(“x2 pooling”). In VGG_19, both stride and padding are set 
to 1 for all the convolutional layers. The VGG_19 does not 
contain the Local Response Normalization (LRN), which re¬ 
quires considerable computation but contributes little to the 
performance. 

4.2.1 Model Fusion 

Combining various deep learning models can usually pro¬ 
duce better performance than using just a single model, be¬ 
cause models using different architectures or trained with 
different parameters may contain complementary informa¬ 
tion. For instance, a model trained with larger convolu¬ 
tional filters may focus more on large patterns, while one 
with smaller filters may be more sensitive to hner-grained 
details. Since both the two CNN architectures, CNN_M and 
VGG_19, can be trained with multiple parameters, there are 
consequently several candidate models that can be exploited 
in this experiment. We examine different combinations to 
integrate information from these candidate models. 

4.2.2 Modality Fusion 

As aforementioned, videos are naturally multimodal, and 
hence the integration of the spatial and the temporal streams 
is very important. Simonyan et al adopted a simple 
linear fusion method and fixed weights without explana¬ 
tions. We will examine the effect of this fusion weight on two 
very different datasets. Notice that another very important 
modality, the audio channel, is not considered in this work 
because we focus on examining options of training models 
using only the visual data following the two-stream pipeline. 


However, we strongly believe that better performance can be 
achieved by further including the audio information, because 
this has been observed in many recent works [10| . 

4.3 Learning Parameters 

Although deep learning has achieved promising results on 
many tasks, training and fine-tuning a good model normally 
require signihcant efforts as there are several important pa¬ 
rameters that need to be evaluated, such as learning rate, 
dropout ratio and the number of training iterations. These 
seemingly arbitrarily chosen parameters can influence the 
performance significantly. For instance, a small learning rate 
may demand much more iterations to converge, while a large 
value may accelerate the convergence but can possibly result 
in oscillation. In addition, a larger dropout ratio may lead 
to a better model but could probably slow down the con¬ 
vergence. There is no universal rule for parameter selection. 
With the goal of providing some insights on parameter se¬ 
lection specially for the problem of video classification, we 
study different sets of parameters using the two aforemen¬ 
tioned network architectures and two datasets. 

In particular, the dropout ratio and the number of itera¬ 
tions are jointly evaluated and discussed. For learning rate, 
we set it to 10“^ initially, and then decreased to 10“^ after 
lOOK iterations, then to 10“^ after 200K iterations. In the 
fine-tuning case, the rate starts from 10“^ and decreases to 
after 14K iterations, then to after 20K iterations. 
This setting is similar to [M, but we start from a smaller 
rate of instead of 10~^. Note that other choices on 

learning rate are not evaluated as we hnd that the final per¬ 
formance is less sensitive to this parameter as long as it is 
set following the suggested rules in [^ . 

4.4 Prediction Options 

While neural networks can act as end-to-end classifiers 
by using a final softmax layer, traditional classifiers like the 
SVMs can also be deployed on the features extracted by 
the CNN, which are generally the outputs of the last sev¬ 
eral fully connected layers. Recently, Razavian et al 
adopted features extracted from a CNN model pre-trained 
on ImageNet to perform classihcation with SVMs. They 
demonstrated strong performance on image analysis tasks 
like scene recognition, object detection, etc. In addition, 
Jain et al. leveraged the CNN features using SVMs for 
action recognition in videos, and achieved superior perfor¬ 
mance on the THUMOS action recognition benchmark [^. 
The results suggest that the CNN features may be used in 
combination with traditional classihers for improved perfor¬ 
mance, but these existing works were performed only on 
images or the spatial frames. This paper investigates the 
performance of features extracted from different layers of 
both the spatial and the temporal CNNs using SVMs for 
classification. Results are compared with that of the end- 
to-end neural network based approach. 

5. EXPERIMENTS 

5.1 Datasets and Evaluation Criteria 

UCF-101 [^. The UCF-101 dataset is a widely adopted 
benchmark for action recognition in videos, which consists 
of 13,320 video clips (27 hours in total). There are 101 an¬ 
notated classes that can be divided into five types: Human- 
Object Interaction, Body-Motion Only, Human-Human In- 











teraction, Playing Musical Instruments and Sports. We per¬ 
form evaluation according to the popular 3 train/test splits 
following [^. Results are measured by classification accu¬ 
racy on each split and mean accuracy over the 3 splits. For 
some evaluations, we only report results on the hrst split 
due to computation limitation. 

Columbia Consumer Videos (CCV) [^. The CCV 
dataset contains 9,317 YouTube videos annotated accord¬ 
ing to 20 classes, such as “wedding ceremony”, “birthday 
party”, “skiing” and “playground”. We follow the protocol 
dehned in to use a training set of 4,659 videos and a 
test set of 4,658 videos. Results are measured by average 
precision (AP) for each class and mean AP (MAP) across 
all the classes [^. Note that, different from the UCFIOI 
actions, most classes in CCV are social events, sports, ob¬ 
jects and scenes. In addition, the average duration of this 
dataset is around 80 seconds, which is over ten times longer 
than that of the UCF-101. We hope that using the two 
datasets with different characteristics can help lead to more 
generalizable conclusions. 

For both datasets, we adopt the same data augmentation 
strategies as [24] . 

5.2 Network Options 

Using what network structure is the hrst decision we have 
to make in the implementation of a deep learning based 
video classihcation system. There are numerous options. In 
this work, we evaluate and compare the two popular struc¬ 
tures CNN_M and VGG_19. 

Results of the spatial stream are summarized in Table 
As can be seen, VGG_19 produces consistently better results 
on both datasets, indicating that larger (deeper) networks 
are generally better. This is consistent with the observa¬ 
tions on the large scale image classihcation tasks. Results 
of different dropout rates are listed in this table because 
this can ofter a more complete understanding of the power 
of the networks under different settings of learning parame¬ 
ters. Detailed discussions on the effect of dropout rates will 
be given later. 

For the temporal stream, we also tried to use the VGG_19 
network under a few parameter settings, but observed clearly 
worse results than the CNN_M. As the gap is clear, we did 
not proceed to hnish all the parameters to fully compare the 
two networks. The key reason is that the temporal stream 
has to be trained from scratch, which is different from the 
spatial stream that can use hne-tuning to ad just the pre¬ 
trained network based on millions of images [^. In this 
case, learning a smaller temporal network is more feasible 
with limited training data. We expect that better results 
can be achieved by VGG_19 for the temporal stream when 
there is sufficient training data. 

5.3 Model and Modality Fusion 

Next, we discuss results by fusing models and modalities. 
For the combination of different models, we use the spatial 
stream and 10 network models (2 structures each trained 
with 5 dropout rates). We tried all the possible modal com¬ 
binations by averaging their prediction scores and identihed 
the top 3 results in order to learn which model is more reli¬ 
able and what combinations are good. Results are shown in 
Table We see that VGG_19 models are more “popular” in 
the top combinations on both datasets, conhrming the fact 
that fusing good models generally offers good results. How¬ 


Models 

UCF-101 (split-1) 

CCV 

CNN M drl 

71.58% 

68.78% 

CNN M dr3 

68.65% 

68.81% 

CNN M dr5 

68.25% 

68.64% 

CNN M dr7 

68.15% 

67.40% 

CNN M dr9 

60.85% 

51.81% 

VGG 19 drl 

75.87% 

74.66% 

VGG 19 dr3 

79.59% 

74.47% 

VGG 19 dr5 

80.41% 

75.04% 

VGG 19 dr7 

76.66% 

74.90% 

VGG_19 dr9 

76.39% 

73.09% 


Table 2: Spatial stream results of two network architectures 
on UCF-101 and CCV under different dropout rates (“drl” 
indicates the 0.1 dropout rate). See texts for discussions. 


Models 

UCF-101 (split-1) 

GGV 

1 

2 

3 

1 

2 

3 

CNN M drl 

/ 

/ 




/ 

CNN M dr3 







CNN M dr5 







CNN M dr7 







CNN M dr9 


/ 

/ 

/ 



VGG 19 drl 






/ 

VGG 19 dr3 

/ 

/ 



/ 

/ 

VGG 19 dr5 

/ 

/ 

/ 

/ 

/ 


VGG 19 dr7 




/ 

/ 

/ 

VGG 19 dr9 

/ 

/ 




/ 

Perl. (%) 

80.46 

80.41 

80.33 

75.31 

75.31 

75.30 


Table 3: Top-3 spatial stream model fusion results on both 
datasets. The “/” sign indicates the used models in each of 
the top combinations. A general observation is that fusing 
good models like the VGG_19 based ones tend to generate 
good results, but the improvement is fairly limited. 


ever, comparing the model fusion results with the individual 
model results in Table it is clear that fusing models does 
not improve results signihcantly. For instance, on the UCF¬ 
IOI, the 2nd best results by fusing hve models is actually the 
same with just using the best single model (VGG_19 dr5). 
Therefore, one conclusion is that fusing a strong network 
(VGG_19) with a relatively weaker network (CNN_M) does 
not help much (if not becoming worse). Notice that aver¬ 
age model prediction fusion is adopted here. Using dynamic 
fusion weights may lead to better results, but the gain is 
unlikely to be signihcant. 

For modality fusion, we use the best spatial network con- 
hgurations to fuse with a temporal network trained using 
the CNN_M architecture. Results of different fusion weights 
are plotted in Figure Comparing the temporal stream 
with the spatial counterpart, the latter produces better re¬ 
sults on both datasets. The gap is larger on CCV as its 
categories are easier to be recognized by viewing just one 
or a few frames, e.g., the sports classes “basketball” and 
“skiing”. Fusing the two modalities is effective, leading to 
signihcant improvement on UCF-101 (best fusion: 86.7%; 
spatial: 80.5%; temporal: 78.3%). The gain on CCV is lim¬ 
ited as the result of the temporal stream is not good (best 
fusion: 75.9%; spatial: 75.3%; temporal: 59.4%), which is 
generally consistent with the results of the hand-crafted fea- 








































Spatial (VGG 19) 

Temporal (GNN M) 

Spatial-Temporal Eusion 

Feature 

Early Fusion 

Late Fusion 

Early Eusion 

Late Eusion 

Early Eusion 

Late Eusion 

UCF-101 (split-1) 

FCl 

75.84% 

70.71% 

78.22% 

76.74% 

87.55% 

82.34% 

FC2 

72.75% 

64.42% 

77.85% 

76.24% 

85.94% 

80.52% 

FC1&FC2 

75.73% 

70.29% 

78.30% 

76.63% 

87.71% 

82.00% 

CCV 

FCl 

70.75% 

67.34% 

58.04% 

54.08% 

73.25% 

68.87% 

FC2 

70.45% 

68.85% 

55.86% 

52.52% 

72.76% 

70.06% 

FC1VFC2 

71.15% 

69.25% 

58.79% 

54.40% 

73.27% 

69.90% 


Table 4: Prediction results of SVMs classifiers on the CNN features. FCl (FC2) indicates the output feature of the hrst 
(second) fully connected layer. “FC1&FC2” is the concatenation of the FCl and FC2 features. 
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Figurej^plots the temporal stream results on both datasets, 
using the CNN_M architecture with various learning param¬ 
eters. We observe that a large dropout ratio requires more 
iterations to reach a high level of performance, which is easy 
to understand. Different from the spatial stream observa¬ 
tions on CNN_M, a large dropout ratio can also lead to com¬ 
parable results. This may be because the temporal stream 
networks are trained from scratch, and, even using the same 
architecture, training the entire framework is more complex 
than hne-tuning (only tuning three FC layers). Further¬ 
more, some researchers expressed the view that dropout can 
be considered as a form of training set augmentation [23] . 
Complex networks may be more suitable to learn from highly 
augmented input data. 


Figure 2: Performance (%) of combining spatial and tem¬ 
poral streams using linear fusion with different weights. 
Temporal stream weight is set as “1—spatial weight”. Thus 
the left-end points of the curves indicate the performance of 
using the temporal stream alone, while the right-end is the 
spatial stream performance. 


tures on this dataset Ipp As for the suitable fusion weight, 
the results indicate that similar or higher temporal weight is 
preferred for classes that can be better recognized by view¬ 
ing a clip (not just a frame), even when the temporal stream 
performs worse than the spatial stream. 


5.4 Effect of Learning Parameters 

We jointly evaluate the effect of two learning parameters: 
the dropout ratio and the number of training iterations. We 
hrst study the spatial stream. Figure and Figure [^visu¬ 
alize the results on UCF-101 and CCV respectively. Notice 
that the spatial networks are hne-tuned based on the mod¬ 
els pre-trained on the Image-Net (only the last three FC 
layers are hne-tuned), and therefore they start from a fairly 
good initialization and become stable quickly after just 10-20 
thousands of iterations. With the uniform settings on learn¬ 
ing rate (cf. Section 4.3), the number of iterations required 
to reach convergence is similar across different network ar¬ 
chitectures and dropout ratios. One interesting observation 
is that large dropout ratios (e.g., 0.9) are especially unsuited 
for small networks like the CNN_M. This is probably because 
a small network can hardly learn anything if as high as 90% 
of the information are dropped at each iteration, particu¬ 
larly for the case of hne-tuning where only the last three 
layers are adjusted. 

^On CCV, it was reported that static frame features are 
signihcantly better than spatial-temporal features. 


5.5 Softmax vs. SVMs 

As discussed in Section 4.3, neural networks can be used 
as an end-to-end classiher or a feature extractor. In this sec¬ 
tion, we discuss results of using SVMs on the CNN features, 
and compare with softmax. We train linear SVMs using out¬ 
puts from the hrst and the second FC layers. VGG_19 and 
CNN_M are adopted for the spatial and temporal stream 
respectively. As each video has multiple spatial frames and 
stacked optical how images, there are two ways to train the 
SVMs classihers. One is early fusion by averaging features 
from all the frames hrst before classiher training and test¬ 
ing. The other is late fusion, which takes all the frames as 
inputs separately and uses average prediction scores as the 
hnal video-level score. 

Table [^ summarizes the results of spatial, temporal, and 
their fusion on both datasets. Compared with the softmax 
based prediction, SVMs is only slightly better on the UCF- 
101 under the spatial-temporal modality fusion setting using 
the early frame fusion method. On CCV, the performance 
is signihcantly lower than softmax. Comparing early frame 
fusion with late fusion, early fusion is consistently good, in¬ 
dicating that averaging frame features before classihcation 
may help suppress noises that affect SVMs training. Inter¬ 
estingly, the neural networks take individual frames as in¬ 
puts like the late fusion based SVMs, but are quite robust. 

5.6 Comparison with the State of the Arts 

Finally, we compare our results with the state of the arts 
on both datasets. For UCF-101, we report average accu¬ 
racies over the three official train-test splits. As shown in 
Table our results are competitive on both datasets. The 
performance on CCV is signihcantly better than the state 
of the arts, all of which adopted traditional features like the 
dense trajectories [^. On UCF-101, our results are com¬ 
parable to a very recent work [^, which uses an extensive 
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Figure 3: Spatial stream results on UCF-101 (split-1), under different dropout ratios (from 0.1 to 0.9) and iteration numbers. 
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Figure 4: Spatial stream results on CCV, under different dropout ratios and iteration numbers. 
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Figure 5: Temporal stream results on both UCF-101 (split-1) and CCV, using CNN_M with different dropout ratios and 
iteration numbers. 


fusion approach on top of state-of-the-art hand-crafted fea¬ 
tures. Our results are also similar to that of [^. Note 
that the 88.0% reported in was attained by using exter¬ 
nal training data from another human action benchmark. If 
only trained on UCF-101, the performance is lower. 

6. SUMMARY AND DISCUSSION 

Building a deep learning system for video classification is 
not an easy task. We have evaluated several important im¬ 
plementation options. The major findings are summarized 


in the following. Note that knowing what works and what 
does not work is equally important. 

For network architectures, one observation is that deeper 
networks like the VGG_19 are better, but a sufficient amount 
of training data is required. This is fine for the spatial 
stream, as the image annotations like the ImageNet can be 
used to pre-train the network. For the temporal stream, as 
we cannot use the image collections for model pre-training, 
the results of very deep networks are not good. We envision 
that the they would work well on the temporal stream once 
we have a large amount of training data in the video domain. 
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Ours - Softmax 
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87.7% 
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Table 5: Comparison with the state-of-the-art results. 


Results indicate that fusing multiple network models is 
not very helpful, especially when combining a strong net¬ 
work with a weak one. Fusing two networks with a simi¬ 
lar performance level but different architectures might help, 
but this is not verified based on our experiments. In addi¬ 
tion, combining predictions from the spatial and the tem¬ 
poral streams is useful. This is important for the classifica¬ 
tion of long-term procedural actions, which benefits signifi¬ 
cantly from the temporal clues. The linear weighted spatial- 
temporal fusion method works well but is not ideal. This 
aspect deserves more investigations. 

We also observe that a moderate dropout ratio (0.5 for 
spatial fine-tuning and 0.7 for temporal training) is consis¬ 
tently good. Large dropout ratios like 0.9 may be unsuited 
for small networks with less layers, particularly under the 
fine-tuning setting that only adjusts the last several layers. 
Finally, we find that softmax seems a better choice with 
consistently good results. 

Deep learning based approaches are already showing bet¬ 
ter results than the traditional techniques for video classifi¬ 
cation. We believe that the room for further improvement 
is huge. First, the temporal stream results might be signifi¬ 
cantly boosted if we could have sufficient labeled video data 
to train a deeper network. Second, although the two-stream 
framework is adopted in this work, future approaches do 
not necessarily need to follow this pipeline. After all, it is 
all about the way of modeling the temporal dimension of the 
videos, which can be achieved by using alternative solutions 
like the RNN or devising new network architectures. 
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